27 research outputs found
Incorporating peak grouping information for alignment of multiple liquid chromatography-mass spectrometry datasets
Motivation: The combination of liquid chromatography and mass spectrometry (LC/MS) has been widely used for large-scale comparative studies in systems biology, including proteomics, glycomics and metabolomics. In almost all experimental design, it is necessary to compare chromatograms across biological or technical replicates and across sample groups. Central to this is the peak alignment step, which is one of the most important but challenging preprocessing steps. Existing alignment tools do not take into account the structural dependencies between related peaks that co-elute and are derived from the same metabolite or peptide. We propose a direct matching peak alignment method for LC/MS data that incorporates related peaks information (within each LC/MS run) and investigate its effect on alignment performance (across runs). The groupings of related peaks necessary for our method can be obtained from any peak clustering method and are built into a pairwise peak similarity score function. The similarity score matrix produced is used by an approximation algorithm for the weighted matching problem to produce the actual alignment result.<p></p>
Results:
We demonstrate that related peak information can improve alignment performance. The performance is evaluated on a set of benchmark datasets, where our method performs competitively compared to other popular alignment tools.<p></p>
Availability: The proposed alignment method has been implemented
as a stand-alone application in Python, available for download at
http://github.com/joewandy/peak-grouping-alignment.<p></p>
Unsupervised Bayesian explorations of mass spectrometry data
In recent years, the large-scale, untargeted studies of the compounds that serve as workers in the cell (proteins) and the small molecules involved in essential life-sustaining chemical processes (metabolites) have provided insights into a wide array of fields, such as medical diagnostics, drug discovery, personalised medicine and many others. Measurements in such studies are routinely performed using liquid chromatography mass spectrometry (LC-MS) instruments. From these measurements, we obtain a set of peaks having mass-to-charge, retention time (RT) and intensity values. Before further analysis is possible, the raw LC-MS data has to be processed in a data pre-preprocessing pipeline. In the alignment step of the pipeline, peaks from multiple LC-MS measurements have to be matched. In the identification step, the identity of unknown compounds in the sample that generate the observed peaks have to be assigned. Using tandem mass spectrometry, fragmentation peaks characteristic to a compound can be obtained and used to help establish the identity of the compound. Alignment and identification are challenging because the true identities of the entire set of compounds in the sample are unknown, and a single compound can produce many observed peaks, each with a potential drift in its retention time value. These observed peaks are not independent as they can be explained as being generated by the same compound.
The aim of this thesis is to introduce methods to group these related peaks and to use these groupings to improve alignment and assist in identification during data pre-processing. Firstly, we introduce a generative model to group related peaks by their retention time. This information is used to influence direct-matching alignment, bringing related peak groups closer during matching. Investigations using benchmark datasets reveal that improved alignment performance is obtained from this approach. Next, we also consider mass information in the grouping process, resulting in PrecursorCluster, a model that performs the grouping of related peaks in metabolomics by their explainable mass relationships, RT and intensity values. Through a second-stage process that matches these related peak groups, peak alignment is produced. Experiments on benchmark datasets show that an improved alignment performance is obtained, while uncertainties in matched peaksets can also be extracted from the method. In the next section, we expand upon this two-stage method and introduce HDPAlign, a model that performs the clustering of related peaks within and across multiple LC-MS runs at once. This allows for matched peaksets and their respective uncertainties to be naturally extracted from the model. Finally, we look at fragmentation peaks used for identification and introduce MS2LDA, a topic model to group related fragmentation features. These groups of related fragmentation features potentially correspond to substructures shared by metabolites and can be used to assist data interpretation during identification. This final section corresponds to a work in progress and points to many interesting avenues for future research
MetAssign: probabilistic annotation of metabolites from LC–MS data using a Bayesian clustering approach
Motivation: The use of liquid chromatography coupled to mass spectrometry (LC–MS) has enabled the high-throughput profiling of the metabolite composition of biological samples. However, the large amount of data obtained can be difficult to analyse and often requires computational processing to understand which metabolites are present in a sample. This paper looks at the dual problem of annotating peaks in a sample with a metabolite, together with putatively annotating whether a metabolite is present in the sample. The starting point of the approach is a Bayesian clustering of peaks into groups, each corresponding to putative adducts and isotopes of a single metabolite.<p></p>
Results: The Bayesian modelling introduced here combines information from the mass-to-charge ratio, retention time and intensity of each peak, together with a model of the inter-peak dependency structure, to increase the accuracy of peak annotation. The results inherently contain a quantitative estimate of confidence in the peak annotations and allow an accurate trade off between precision and recall. Extensive validation experiments using authentic chemical standards show that this system is able to produce more accurate putative identifications than other state-of-the-art systems, while at the same time giving a probabilistic measure of confidence in the annotations.<p></p>
Availability: The software has been implemented as part of the mzMatch metabolomics analysis pipeline, which is available for download at http://mzmatch.sourceforge.net/
GraphOmics: an interactive platform to explore and integrate multi-omics data
Background:
An increasing number of studies now produce multiple omics measurements that require using sophisticated computational methods for analysis. While each omics data can be examined separately, jointly integrating multiple omics data allows for deeper understanding and insights to be gained from the study. In particular, data integration can be performed horizontally, where biological entities from multiple omics measurements are mapped to common reactions and pathways. However, data integration remains a challenge due to the complexity of the data and the difficulty in interpreting analysis results.
Results:
Here we present GraphOmics, a user-friendly platform to explore and integrate multiple omics datasets and support hypothesis generation. Users can upload transcriptomics, proteomics and metabolomics data to GraphOmics. Relevant entities are connected based on their biochemical relationships, and mapped to reactions and pathways from Reactome. From the Data Browser in GraphOmics, mapped entities and pathways can be ranked, sorted and filtered according to their statistical significance (p values) and fold changes. Context-sensitive panels provide information on the currently selected entities, while interactive heatmaps and clustering functionalities are also available. As a case study, we demonstrated how GraphOmics was used to interactively explore multi-omics data and support hypothesis generation using two complex datasets from existing Zebrafish regeneration and Covid-19 human studies.
Conclusions:
GraphOmics is fully open-sourced and freely accessible from https://graphomics.glasgowcompbio.org/. It can be used to integrate multiple omics data horizontally by mapping entities across omics to reactions and pathways. Our demonstration showed that by using interactive explorations from GraphOmics, interesting insights and biological hypotheses could be rapidly revealed
Ms2lda.org: web-based topic modelling for substructure discovery in mass spectrometry
Motivation: We recently published MS2LDA, a method for the decomposition of sets of molecular fragment data derived from large metabolomics experiments. To make the method more widely available to the community, here we present ms2lda.org, a web application that allows users to upload their data, run MS2LDA analyses and explore the results through interactive visualisations.
Results: Ms2lda.org takes tandem mass spectrometry data in many standard formats and allows the user to infer the sets of fragment and neutral loss features that co-occur together (Mass2Motifs). As an alternative workflow, the user can also decompose a dataset onto predefined Mass2Motifs. This is accomplished through the web interface or programmatically from our web service
R package for statistical inference in dynamical systems using kernel based gradient matching: KGode
Many processes in science and engineering can be described by dynamical systems based on nonlinear ordinary differential equations (ODEs). Often ODE parameters are unknown and not directly measurable. Since nonlinear ODEs typically have no closed form solution, standard iterative inference procedures require a computationally expensive numerical integration of the ODEs every time the parameters are adapted, which in practice restricts statistical inference to rather small systems. To overcome this computational bottleneck, approximate methods based on gradient matching have recently gained much attention. The idea is to circumvent the numerical integration step by using a surrogate cost function that quantifies the discrepancy between the derivatives obtained from a smooth interpolant to the data and the derivatives predicted by the ODEs. The present article describes the software implementation of a recent method that is based on the framework of reproducing kernel Hilbert spaces. We provide an overview of the methods available, illustrate them on a series of widely used benchmark problems, and discuss the accuracy–efficiency trade-off of various regularization methods
In silico optimization of mass spectrometry fragmentation strategies in metabolomics
Liquid chromatography (LC) coupled to tandem mass spectrometry (MS/MS) is widely used in identifying small molecules in untargeted metabolomics. Various strategies exist to acquire MS/MS fragmentation spectra; however, the development of new acquisition strategies is hampered by the lack of simulators that let researchers prototype, compare, and optimize strategies before validations on real machines. We introduce Virtual Metabolomics Mass Spectrometer (ViMMS), a metabolomics LC-MS/MS simulator framework that allows for scan-level control of the MS2 acquisition process in silico. ViMMS can generate new LC-MS/MS data based on empirical data or virtually re-run a previous LC-MS/MS analysis using pre-existing data to allow the testing of different fragmentation strategies. To demonstrate its utility, we show how ViMMS can be used to optimize N for Top-N data-dependent acquisition (DDA) acquisition, giving results comparable to modifying N on the mass spectrometer. We expect that ViMMS will save method development time by allowing for offline evaluation of novel fragmentation strategies and optimization of the fragmentation strategy for a particular experiment
Deciphering complex metabolite mixtures by unsupervised and supervised substructure discovery and semi-automated annotation from MS/MS spectra
Complex metabolite mixtures are challenging to unravel. Mass spectrometry (MS) is a widely
used and sensitive technique to obtain structural information on complex mixtures. However, just
knowing the molecular masses of the mixture’s constituents is almost always insufficient for
confident assignment of the associated chemical structures. Structural information can be
augmented through MS fragmentation experiments whereby detected metabolites are
fragmented giving rise to MS/MS spectra. However, how can we maximize the structural
information we gain from fragmentation spectra?
We recently proposed a substructure-based strategy to enhance metabolite annotation for
complex mixtures by considering metabolites as the sum of (bio)chemically relevant moieties that
we can detect through mass spectrometry fragmentation approaches. Our MS2LDA tool allows
us to discover - unsupervised - groups of mass fragments and/or neutral losses termed
Mass2Motifs that often correspond to substructures. After manual annotation, these Mass2Motifs
can be used in subsequent MS2LDA analyses of new datasets, thereby providing structural
annotations for many molecules that are not present in spectral databases.
Here, we describe how additional strategies, taking advantage of i) combinatorial in-silico
matching of experimental mass features to substructures of candidate molecules, and ii)
automated machine learning classification of molecules, can facilitate semi-automated annotation
of substructures. We show how our approach accelerates the Mass2Motif annotation process and
therefore broadens the chemical space spanned by characterized motifs. Our machine learning
model used to classify fragmentation spectra learns the relationships between fragment spectra
and chemical features. Classification prediction on these features can be aggregated for all
molecules that contribute to a particular Mass2Motif and guide Mass2Motif annotations.
To make annotated Mass2Motifs available to the community, we also present motifDB: an open
database of Mass2Motifs that can be browsed and accessed programmatically through an
Application Programming Interface (API). MotifDB is integrated within ms2lda.org, allowing users
to efficiently search for characterized motifs in their own experiments. We expect that with an
increasing number of Mass2Motif annotations available through a growing database we can more
quickly gain insight in the constituents of complex mixtures. That will allow prioritization towards
novel or unexpected chemistries and faster recognition of known biochemical building blocks
Topic modeling for untargeted substructure exploration in metabolomics
The potential of untargeted metabolomics to answer important questions across the life
sciences is hindered due to a paucity of computational tools that enable extraction of key biochemically
relevant information. Available tools focus on using mass spectrometry fragmentation
spectra to identify molecules whose behavior suggests they are relevant to the system
under study. Unfortunately, fragmentation spectra cannot identify molecules in isolation,
but require authentic standards or databases of known fragmented molecules. Fragmentation
spectra are, however, replete with information pertaining to the biochemical processes
present; much of which is currently neglected. Here we present an analytical workflow that
exploits all fragmentation data from a given experiment to extract biochemically-relevant
features in an unsupervised manner. We demonstrate that an algorithm originally utilized for
text-mining, Latent Dirichlet Allocation, can be adapted to handle metabolomics datasets.
Our approach extracts biochemically-relevant molecular substructures (‘Mass2Motifs’) from
spectra as sets of co-occurring molecular fragments and neutral losses. The analysis allows
us to isolate molecular substructures, whose presence allows molecules to be grouped
based on shared substructures regardless of classical spectral similarity. These substructures
in turn support putative de novo structural annotation of molecules. Combining this
spectral connectivity to orthogonal correlations (e.g. common abundance changes under
system perturbation) significantly enhances our ability to provide mechanistic explanations
for biological behavior
PiMP my metabolome:An integrated, web-based tool for LC-MS metabolomics data
Summary: The Polyomics integrated Metabolomics Pipeline (PiMP) fulfils an unmet need in metabolomics
data analysis. PiMP offers automated and user-friendly analysis from mass spectrometry data
acquisition to biological interpretation. Our key innovations are the Summary Page, which provides a
simple overview of the experiment in the format of a scientific paper, containing the key findings of
the experiment along with associated metadata; and the Metabolite Page, which provides a list of
each metabolite accompanied by ‘evidence cards’, which provide a variety of criteria behind metabolite
annotation including peak shapes, intensities in different sample groups and database information.
Availability: PiMP is available at http://polyomics.mvls.gla.ac.uk, and access is freely available on
request. 50 GB of space is allocated for data storage, with unrestricted number of samples and analyses
per user. Source code is available at https://github.com/RonanDaly/pimp and licensed under the
GPL